Arabic Speech Analysis for Classification and Prediction of Mental Illness due to Depression Using Deep Learning

,


Introduction
Depression is known as a mental disorder or mental illness, and according to WHO, currently more than 300 million (4.4%) people are affected by depression [1], and its rate is continually increasing [2]. From 2005 to 2015, almost 18% of the occurrence of depression has increased worldwide. Depression leads to somatic problems, mental disorders, sleep disorders, and gastrointestinal problems. e selfconfidence and rumination symptoms show in depressionrelated patients [3,4]. It affects the functioning or performance of patients at school, family, and work. It may also severely impact people causing self-harm and sometimes suicide. Mood disorder and mental illness in adult life are also associated with depressive disorder [5,6]. From depression, people may also experience a bad mood, low selfesteem, loss of interest, low energy, and body pain without a clear cause [7]. Automatic speech recognition (ASR) is well known as speech recognition. It provides the facility of understanding the users' speech by converting the word speech into series using the computer [8]. A speech emotion recognition system is helpful in medical practice for detecting changes in mental state and emotions. For example, when a patient has mood swings, the system will react rapidly and examine their current psychological state [9]. As a result, the depression prediction methods might help design better mental health care software and technologies such as intelligent robots.
1.1. Background. Depression rates are continually increasing in people where many issues occur from this mental disorder in daily life. Unfortunately, it is difficult to predict depression from people while neutral speaking. Machine learning can be considered one of the most common ways to look at data from different sources and figure out how people feel and speak under depression.
Early recognition of depressed symptoms, followed by evaluation and therapy, may greatly enhance the odds of controlling symptoms and the underlying illness and attenuate harmful consequences for health and social life. However, detecting depression disorder is difficult and timeconsuming. Current methods primarily rely on clinical discussion and surveys conducted by a psychologist for mental disorder predictions. is method is largely based on one-on-one surveys and may generally identify depression as a mental disorder condition. Since machine learning models are increasingly being used to make essential predictions in critical situations daily, the demand for transparency from all the people in the AI industry grows in these situations. Many research projects attempt to develop an automated depression detection system [10].
Deep neural networks have lately made significant contributions to a wide range of disciplines of study, including pattern recognition, and proved a better option than traditional machine learning techniques such as SVM, ANN, HMM, and so on. Han et al. [16] proposed a DNN-ELM (extreme learning machine) based voice emotion classification system. Bertero and Fung [17] used the convolutional neural network (CNN), which has a lot of applications in this field to recognize voice-related emotions, and reported good results. In the subsequent research, RNN and LSTM (long short-term memory) were also enhanced, and GRU [18], QRNN [19], and other models were also proposed for speech data. Simultaneously, different work attempted to integrate the CNN and RNN into a CRNN model for speech emotion recognition [20]. e 1D-CNN architecture improves the individual systems' performance since it was recently developed to deal with text or one-dimension data such as human speech. However, ensemble CNN models exhibited better performance for emotions classification using speech analysis [7].
To help address these issues, we built an automated method for identifying depressive symptoms from Arabic speech analysis. e proposed automated mental illness identification technique, which describes users' concerns in Arabic, might significantly contribute to this research area.
is study proposed a hybrid model (CNN + SVM) to classify depression from Arabic speech analysis and predict mental disorders. Additionally, results are compared with RNN and 1-D CNN for the same problem on the same data set.

Main Contributions.
is research has the following main contributions: (i) e first time, CNN + SVM-based hybrid model is proposed for Arabic speech analysis to predict mental illness due to depression and attained approximately 92% accuracy (ii) A large Arabic speech benchmark data set is employed for experiments (iii) Experts from both the medical and psychology fields are consulted to derive possible symptoms of depression for best features identification (iv) RNN and CNN are individually applied to the same data set for analysis and comparisons of the results of the proposed hybrid model (v) Using our model researcher will detect depression while speaking the Arabic language with an approximately 92% accuracy rate Furthermore, this research is divided into four main sections. Section 2 presents the proposed methodology. Section 3 details experimental results with analysis. Section 4 compares the results of the proposed hybrid model with individual RNN and CNN on the same benchmark data set. Finally, Section 5 summarizes the research.

Proposed Methodology
is study is designed to predict depression using recorded Arabic speech analysis or while speaking in the Arabic language with the proposed hybrid approach exhibited in Figure 1 and compare with deep learning (DL) models such as RNN and CNN.
First, we extracted the features from the speeches of both depression and nondepression groups.
CNN is a deep learning model used for pattern classification and is composed of an input layer, hidden layers, and output layer F � (Y, W) � X, where Y is the input, W is the weight vector, F is any function, and X is the output. e hidden layer contains four components: the convolution layer, pooling layer, fully connected layer, and activation function [21].
(i) Convolution layer: a kernel is selected that goes over the input vector that produces a feature map , where x i,j is the output of the convolution operator, W is the kernel with goes over, Y is the input, σ denotes the nonlinearity in the network, and b is the bias [21][22][23]. (ii) Pooling layer: the dominant features are extracted by selecting a window that passes through the pooling function, average pooling, max-pooling, or stochastic pooling [24]. (iii) Fully connected layer: the convolution and pooling outputs are included here, and the final dot product of input and weight vector is computed in this layer (iv) Activation function: sigmoid (it takes values between [0, 1]) also called logistic function; in CNN, its use may cause vanishing or gradient ) and [14] softmax (it takes a vector argument and transforms to a vector whose elements fall in the range [0, 1]). When all our dependent variables are categorical, then softmax function is appropriate f(x) � (e z i / n j e z j ), and ReLU does not allow the gradient to vanish f(x) � max(0, x) for values greater than zero; it is linear [24].
(v) Support vector machine (SVM): it is a nonparametric supervised machine learning technique employed to classify data by fitting a hyperplane to the data [25,26]. ere are different types of SVM learning mechanisms to classify the data; for this purpose, a kernel (kernel selected to make nonlinear data linearly separable) is fitted to the data; the most commonly used kernels are Gaussian [27]. e dense layer of the CNN model is used to make the hybrid approach for depression prediction.
e architecture of the proposed model is explained in Table 1.

Recurrent Neural Network (RNN).
RNN is normally used to analyze sequential data (e.g., speech, text); just like other neural networks, it contains input, hidden and, output layers [28]. e hidden layer, called the recurrent layer, keeps the same parameters in the following layers that keep on updating in its memory, h(t) � f(Wx(t) + Uh(t − 1)), where W and U are weight matrices, x(t). e input vector is h(t − 1), and the correlated hidden layer and f represent the nonlinear activation function [28][29][30]. In the hidden layer, different activation functions are used. e most commonly used are sigmoid and tanh: sigmoid function f(x) � 1/1 + e − x [29] and tanh function h(t) with range (-1, 1) [28]. In the output layer, the softmax function is used for the final output [28,29]. e architecture of RNN is explained in Figure 2. e proposed hybrid approach and individual CNN and RNN are applied to diagnose depression while speaking Arabic.
e training-testing criteria are adopted in the analysis for 200 speeches. A total of 70% (140 speeches) of data are used as a training part, and 30% (60 speeches) of data are used as a testing part. e train data is used to train the CNN + SVM, RNN, and CNN, and test data is used to check the validity of all models and the prediction rate of the trained sample. e accuracy, area under curve (AUC), sensitivity, specificity, false-positive rate (FPR), and falsenegative rate (FNR) are calculated to observe the model's performance in depth using the following equations.

Experimental Results and Performance Analysis
Using Arabic speech analysis, the study predicts depression disorder and compares it with DL models such as RNN and CNN. Out of 100% of the data, 70% of data are used for training and 30% for testing stages.

Data Description.
In this study, we used the Basic Arabic Vocal Emotions Dataset (BAVED), composed of Arabic words spelt in different levels of emotions recorded in an audio format https://www.kaggle.com/a13x10/basic-arabicvocal-emotions-dataset. In experiments, we included seven words, 0 for "like," 1 for "unlike," 2 for "this," 3 for "file," 4 for "good," 5 for "neutral" and 6 for "bad." e seven words are further classified according to their emotional intensity: 0 denotes low emotion including tired or weary, 1 denotes neutral emotion, and 2 denotes strong emotion of happiness, joy, sadness, and anger. e categories labelled as 0 and 1 are for low and neutral emotions that represent nondepression (sadness) and negative emotions (anger).

Hybrid Model
Performance. First, we applied the proposed hybrid model to the data. As a result, we attained a 90% accuracy rate to classify the depression while speaking in the training part and a 91.60% accuracy rate to predict the depression from the testing part. e graphical representation of the accuracy of the CNN + SVM model with a bar chart on train and test data is presented in Figure 3. e red color presents the accuracy of the training data and the blue color presents the accuracy of testing data.
Correctly classifying the depression speeches present in diagonal and off-diagonal values shows incorrect speech prediction. e hybrid model has accurately predicted a total of 126 (depression � 68, nondepression � 58) speeches and 14 speeches incorrectly predicted for the training data set. Similarly, the RNN model has accurately predicted 55 (depression � 31, nondepression � 24) speeches and 5 speeches not correctly predicted for the test data set. Figure 4 presents confusion matrix results of the hybrid model on train and test data.

Individual RNN and CNN Models Performance. RNN
and CNN individually applied the data where the RNN achieved an 80.70% accuracy rate to predict the depression while speaking in the training part and got an 81.60% accuracy rate for the testing part. Similarly, CNN attained an 88.5% accuracy rate to predict the depression while speaking in the training part and attained an 86.60% accuracy rate for the testing part. e accuracies attained in the training and testing stages of RNN and CNN models are exhibited in Figure 5. e red color presents the accuracy of the training data and the blue color presents the accuracy of testing data. e training and testing loss and accuracy are measured for RNN and CNN models are plotted against the 25 epochs shown in Figure 6 e blue and red solid lines represent the accuracies of the RNN and CNN model for train and test data. e dotted blue and red solid lines present the losses of the RNN and CNN model with respect to training and testing data. It is observed that initially, network loss is higher but as epochs increase, the loss shows a decreasing trend in all models [32]. e results of RNN and CNN models with respect to the confusion matrix on train and test data are presented in Figure 7. e correctly classified depressed speeches are presented in diagonal and off-diagonal values presented as the incorrect classified prediction speech. e RNN model has accurately predicted a total of 113 (depression � 69, nondepression � 44) speeches and 27 speeches incorrectly predicted for the training data set. Likewise, the RNN model has predicted a total of 49 (depression � 31, nondepression � 18) speeches accurately and 11 speeches incorrectly predicted for the testing data set. On the other hand, the CNN model has predicted a total of 124 (depression � 66, nondepression � 58) speeches accurately and 16 speeches incorrectly on the train data set. Correspondingly, the CNN model has predicted a total of 52 (depression � 29, nondepression � 23) speeches accurately and 8 speeches incorrectly on the test data set.

Sensitivity Analysis.
e assessment of the models is checked with sensitivity, specificity, FPR, and FNR for both train and test data given in Table 2 Sensitivity and specificity represent a model that correctly identifies depression and nondepression speech if it belongs to depression and nondepression speeches. e FPR and FNR are probabilities showing that a model predicts depression but it belongs to nondepression and predicts nondepression while it belongs to depression [33]. For the training data set, the RNN model achieved the 100%, 61.9%, 0.0, and 0.380 of sensitivity, specificity, FPR, and FNR, respectively. Similarly, for the testing data set, 100%, 62%, 0.0, and 0.379 of sensitivity, specificity, FPR, and FNR, respectively. e CNN model achieved the 95.6%, 81.6%, 0.043, and 0.183 of sensitivity, specificity, FPR, and FNR, respectively, for the training data set. Similarly, 93.5%, 79.3%, 0.064, and 0.206 of sensitivity, specificity, FPR, and FNR, respectively, were attained for the testing data set. e proposed hybrid model achieved the 98.5%, 81.6%, 0.014, and 0.181 of sensitivity, specificity, FPR, and FNR, respectively, for the training data set. Similarly, for testing the data set, 100%, 82.7%, 0.0, and 0.172 of sensitivity, specificity, FPR, and FNR, respectively, were attained. e performance also measured by calculating precision, recall, and F1-score. e hybrid model achieved high precision, recall, and F1-score than individually RNN and CNN. e precision, recall, and F1-score values of the proposed hybrid model were 0.983, 0.816, and 0.892 for training data, respectively. Similarly, 1, 0.827, and 0.905 values were achieved for precision, recall, and F1-score, respectively, for testing data for the proposed hybrid model as presented in Table 3.

ROC Curve Analysis.
e ROC curve is used to plot the sensitivity and specificity of training and testing data. e ROC curve values 0.70-0.80, >0.80 and >0.90 are acceptable, excellent and rarely observed [34]. e ROC with AUC of the RNN, CNN, and CNN + SVM model based on speech analysis is shown in Figure 8. e hybrid approach provided the minimum FPR, FNR, and a higher sensitivity and specificity rate than the RNN and CNN model to predict the depression in the Arabic language.

Discussion and Comparisons.
e study is designed to predict depression using speech or while speaking in the Arabic language with the proposed hybrid approach and compare it with deep learning (DL) models such as RNN and CNN. All approaches are used to diagnose depression while speaking in the Arabic language. e training-testing approach is adopted in our analysis. A total of 70% of data are used as the training part, and 30% of data are used as the testing part. e CNN + SVM is 90.0% and 91.60% that correctly predict the depression while speaking in the training and testing. Overall, the hybrid approach (CNN + SVM) provided better results than RNN and CNN in the same data set. e CNN + SVM provides better results or accuracy than the individual approach in speech data [35]. e RNN has 80.70% and 81.60% that correctly predict depression while speaking in training and testing. Comparably, the CNN has 88.50% and 86.60% that correctly predict depression while speaking in training and testing stages. While the proposed hybrid model predicted 126 speeches correctly and 14 speeches incorrectly for the training data set. Also, it has predicted 55 speeches correctly and 5 speeches not correctly for the testing data set. e RNN model mispredicted 113 speeches correctly and 27 for the training data set. Similarly, the testing data set has predicted 49 speeches correctly and 11 incorrectly. e CNN model mispredicted 124 speeches correctly and 16 for the

Conclusion
is paper has presented a hybrid model to classify depression for mental illness prediction from Arabic speech analysis. Additionally, for the same task, two deep learning models RNN and CNN are also applied individually on the same benchmark database to analyze and compare the results using standard training-testing criteria. e proposed hybrid model attained 90.0% and 91.60% correctly predicted depression while speaking on train and test data. e RNN is 80.70% and 81.60% correctly predicted depression while speaking in training and testing, respectively. e CNN has 88.50% and 86.60% that correctly predict depression while speaking in training and testing. Overall, the hybrid approach provided better results than RNN and CNN on the same benchmark database.
Moreover, the hybrid approach came out with minimum FPR and FNR. It provided a higher sensitivity and specificity rate than the RNN and CNN model to predict depression in the Arabic language. ese research findings will be helpful to detect depression while speaking or in Arabic speech. erefore, doctors, psychiatrists, or psychologists can use our approaches in healthcare applications to see depression while speaking. e doctors could also utilize the proposed approach to identify or separate the depression from neutral or normal speaking. Using our model researcher will detect depression while speaking the Arabic language with an approximately 92% accuracy rate. e proposed model could be used as a tool in the voice recognition field to detect depression while speaking the Arabic language. Depressed